image location
Implicitly Constrained Gaussian Process Regression for Monocular Non-Rigid Pose Estimation
Estimating 3D pose from monocular images is a highly ambiguous problem. Physical constraints can be exploited to restrict the space of feasible configurations. In this paper we propose an approach to constraining the prediction of a discriminative predictor. We first show that the mean prediction of a Gaussian process implicitly satisfies linear constraints if those constraints are satisfied by the training examples. We then show how, by performing a change of variables, a GP can be forced to satisfy quadratic constraints. As evidenced by the experiments, our method outperforms state-of-the-art approaches on the tasks of rigid and non-rigid pose estimation.
ProtoP-OD: Explainable Object Detection with Prototypical Parts
Rath-Manakidis, Pavlos, Strothmann, Frederik, Glasmachers, Tobias, Wiskott, Laurenz
Interpretation and visualization of the behavior of detection transformers tends to highlight the locations in the image that the model attends to, but it provides limited insight into the \emph{semantics} that the model is focusing on. This paper introduces an extension to detection transformers that constructs prototypical local features and uses them in object detection. These custom features, which we call prototypical parts, are designed to be mutually exclusive and align with the classifications of the model. The proposed extension consists of a bottleneck module, the prototype neck, that computes a discretized representation of prototype activations and a new loss term that matches prototypes to object classes. This setup leads to interpretable representations in the prototype neck, allowing visual inspection of the image content perceived by the model and a better understanding of the model's reliability. We show experimentally that our method incurs only a limited performance penalty, and we provide examples that demonstrate the quality of the explanations provided by our method, which we argue outweighs the performance penalty.
Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
In this work we use Branch-and-Bound (BB) to efficiently detect objects with deformable part models. Instead of evaluating the classifier score exhaustively over image locations and scales, we use BB to focus on promising image locations. The core problem is to compute bounds that accommodate part deformations; for this we adapt the Dual Trees data structure to our problem. We evaluate our approach using Mixture-of-Deformable Part Models. We obtain exactly the same results but are 10-20 times faster on average.
Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
In this work we use Branch-and-Bound (BB) to efficiently detect objects with deformable part models. Instead of evaluating the classifier score exhaustively over image locations and scales, we use BB to focus on promising image locations. The core problem is to compute bounds that accommodate part deformations; for this we adapt the Dual Trees data structure to our problem. We evaluate our approach using Mixture-of-Deformable Part Models. We obtain exactly the same results but are 10-20 times faster on average.
Plan-Recognition-Driven Attention Modeling for Visual Recognition
Zha, Yantian, Li, Yikang, Yu, Tianshu, Kambhampati, Subbarao, Li, Baoxin
Human visual recognition of activities or external agents involves an interplay between high-level plan recognition and low-level perception. Given that, a natural question to ask is: can low-level perception be improved by high-level plan recognition? We formulate the problem of leveraging recognized plans to generate better top-down attention maps \cite{gazzaniga2009,baluch2011} to improve the perception performance. We call these top-down attention maps specifically as plan-recognition-driven attention maps. To address this problem, we introduce the Pixel Dynamics Network. Pixel Dynamics Network serves as an observation model, which predicts next states of object points at each pixel location given observation of pixels and pixel-level action feature. This is like internally learning a pixel-level dynamics model. Pixel Dynamics Network is a kind of Convolutional Neural Network (ConvNet), with specially-designed architecture. Therefore, Pixel Dynamics Network could take the advantage of parallel computation of ConvNets, while learning the pixel-level dynamics model. We further prove the equivalence between Pixel Dynamics Network as an observation model, and the belief update in partially observable Markov decision process (POMDP) framework. We evaluate our Pixel Dynamics Network in event recognition tasks. We build an event recognition system, ER-PRN, which takes Pixel Dynamics Network as a subroutine, to recognize events based on observations augmented by plan-recognition-driven attention.
Learning what to look in chest X-rays with a recurrent visual attention model
Ypsilantis, Petros-Pavlos, Montana, Giovanni
X-rays are commonly performed imaging tests that use small amounts of radiation to produce pictures of the organs, tissues, and bones of the body. X-rays of the chest are used to detect abnormalities or diseases of the airways, blood vessels, bones, heart, and lungs. In this work we present a stochastic attention-based model that is capable of learning what regions within a chest X-ray scan should be visually explored in order to conclude that the scan contains a specific radiological abnormality. The proposed model is a recurrent neural network (RNN) that learns to sequentially sample the entire X-ray and focus only on informative areas that are likely to contain the relevant information. We report on experiments carried out with more than $100,000$ X-rays containing enlarged hearts or medical devices. The model has been trained using reinforcement learning methods to learn task-specific policies.
Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
In this work we use Branch-and-Bound (BB) to efficiently detect objects with deformable part models. Instead of evaluating the classifier score exhaustively over image locations and scales, we use BB to focus on promising image locations. The core problem is to compute bounds that accommodate part deformations; for this we adapt the Dual Trees data structure to our problem. We evaluate our approach using Mixture-of-Deformable Part Models. We obtain exactly the same results but are 10-20 times faster on average. We also develop a multiple-object detection variation of the system, where hypotheses for 20 categories are inserted in a common priority queue. For the problem of finding the strongest category in an image this results in up to a 100-fold speedup.
Implicitly Constrained Gaussian Process Regression for Monocular Non-Rigid Pose Estimation
Salzmann, Mathieu, Urtasun, Raquel
Estimating 3D pose from monocular images is a highly ambiguous problem. Physical constraints can be exploited to restrict the space of feasible configurations. In this paper we propose an approach to constraining the prediction of a discriminative predictor. We first show that the mean prediction of a Gaussian process implicitly satisfies linear constraints if those constraints are satisfied by the training examples. We then show how, by performing a change of variables, a GP can be forced to satisfy quadratic constraints. As evidenced by the experiments, our method outperforms state-of-the-art approaches on the tasks of rigid and non-rigid pose estimation.
A Biologically Plausible Model for Rapid Natural Scene Identification
Ghebreab, Sennay, Scholte, Steven, Lamme, Victor, Smeulders, Arnold
Contrast statistics of the majority of natural images conform to a Weibull distribution. This property of natural images may facilitate efficient and very rapid extraction of a scenes visual gist. Here we investigate whether a neural response model based on the Weibull contrast distribution captures visual information that humans use to rapidly identify natural scenes. In a learning phase, we measure EEG activity of 32 subjects viewing brief flashes of 800 natural scenes. From these neural measurements and the contrast statistics of the natural image stimuli, we derive an across subject Weibull response model. We use this model to predict the responses to a large set of new scenes and estimate which scene the subject viewed by finding the best match between the model predictions and the observed EEG responses. In almost 90 percent of the cases our model accurately predicts the observed scene. Moreover, in most failed cases, the scene mistaken for the observed scene is visually similar to the observed scene itself. These results suggest that Weibull contrast statistics of natural images contain a considerable amount of scene gist information to warrant rapid identification of natural images.